Overview

The wine dataset is a highly popular one in the data science community, as it models some of the challenges of real world datasets and can be modeled by a variety of different model types.

We will first explore the data looking for issues or challenges (i.e. missing data, outliers, possible coding errors, multicollinearlity, etc). Once we have a handle on the data, we will apply any necessary cleaning steps. Once we have a reasonable dataset to work with, we will build and evaluate three different linear models that predict seasonal wins. Our dataset includes both training data and evaluation data - we will train using the main training data, then evaluate models based on how well they perform against the holdout evaluation data. Finally we will select a final model that offers the best compromise between accuracy and simplicity.

1. Data Exploration

Dataset

The wine training set contains 16 columns - including the target variable TARGET - and 12795 rows, covering a variety of different brands of wine. The data-set is entirely numerical variables, but also contains some variables that are highly discrete and have a limited number of possible values. We believe it is still reasonable to treat these as numerical variables since the different values follow a natural numerical order.

Below, we created a chart that describes each variable in the dataset and the theoretical effect it will have on the number of wins projected for a team.

Variables of Interest

Given that the Index column had no impact on the target variable, number of wines, it was dropped.

Summary Stats

We compiled summary statistics on our data set to better understand the data before modeling.

Data summary
Name df
Number of rows 12795
Number of columns 15
_______________________
Column type frequency:
numeric 15
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TARGET 0 1.00 3.03 1.93 0.00 2.00 3.00 4.00 8.00 ▆▇▇▆▁
FixedAcidity 0 1.00 7.08 6.32 -18.10 5.20 6.90 9.50 34.40 ▁▂▇▂▁
VolatileAcidity 0 1.00 0.32 0.78 -2.79 0.13 0.28 0.64 3.68 ▁▂▇▂▁
CitricAcid 0 1.00 0.31 0.86 -3.24 0.03 0.31 0.58 3.86 ▁▂▇▂▁
ResidualSugar 616 0.95 5.42 33.75 -127.80 -2.00 3.90 15.90 141.15 ▁▂▇▂▁
Chlorides 638 0.95 0.05 0.32 -1.17 -0.03 0.05 0.15 1.35 ▁▂▇▂▁
FreeSulfurDioxide 647 0.95 30.85 148.71 -555.00 0.00 30.00 70.00 623.00 ▁▂▇▂▁
TotalSulfurDioxide 682 0.95 120.71 231.91 -823.00 27.00 123.00 208.00 1057.00 ▁▂▇▂▁
Density 0 1.00 0.99 0.03 0.89 0.99 0.99 1.00 1.10 ▁▂▇▂▁
pH 395 0.97 3.21 0.68 0.48 2.96 3.20 3.47 6.13 ▁▂▇▂▁
Sulphates 1210 0.91 0.53 0.93 -3.13 0.28 0.50 0.86 4.24 ▁▂▇▂▁
Alcohol 653 0.95 10.49 3.73 -4.70 9.00 10.40 12.40 26.50 ▁▂▇▂▁
LabelAppeal 0 1.00 -0.01 0.89 -2.00 -1.00 0.00 1.00 2.00 ▁▅▇▅▁
AcidIndex 0 1.00 7.77 1.32 4.00 7.00 8.00 8.00 17.00 ▁▇▁▁▁
STARS 3359 0.74 2.04 0.90 1.00 1.00 2.00 3.00 4.00 ▇▇▁▅▂

The first observation is the prevalance of NA’s throughout the dataset. Of the 14 feature columns, 8 of them contain at least some NA values. We also see that the TARGET value is always between 0 and 8, which makes sense as this is the “Number of Cases of Wine Sold” (we would not expect partial cases).

We also note that many of the numerical features measuring the quantity of a chemical in the wine have a negative minimum value. We are assuming the original chemical measurements were normalized (possible a log transform) allowing for negative values, since technically negative concentrations shouldn’t be physically possible. As such, we chose to leave those values as-is.

Distributions

Next, we wanted to get an idea of the distribution profiles for each of the variables.

We see that most variables have a somewhat normal (although steep) distribution.

The distribution profiles show right skew in variables AcidIndex, and STARS.

More interesting is the shape of many features where they are centered with most values clustered at the center and somewhat uniform shape above and below. It almost appears like a tri-modal distribution with a low, middle and high normal distributions overlapping. We are not going to do extensive feature engineer, or we might consider breaking these features up into 3 separate features each. Two approaches include:

  1. Use mixTools to separate the multi-modal curves into 3 distinct (and separate) features each capturing just the low, middle or high values and retaining a numerical value.
  2. Discretize the features converting them into categorical values indicating low, middle or high value.

Boxplots

In addition to creating histogram distributions, we also elected to use box-plots to get an idea of the spread of each variable.

The box-plots do not reveal any enormous outliers in any of the features, meaning it is unlikely we will need to perform outlier detection and removal. AcidIndex, LabelAppeal, and STARS are essentially categorical in nature (ordinal), so we explore how each value of those features relates with TARGET. We see a clear relationship - as LabelAppeal increases, so does TARGET.

We see the same relationship between STARS and TARGET - we especially note that STARS=NA highly correlates with lower TARGET. In the original project instructions, attention was drawn to the fact that missing data might be informative. Based on this, we will impute STARS=NA into STARS=0 which fits with the other values we see for STARS and the pattern that as stars increase, cases sold increase.

Variable Plots

Finally, we wanted to plot scatter plots of each variable versus the target variable, TARGET, to get an idea of the relationship between them.

Due to the discrete nature of the target, it is somewhat difficult to see clear linear relationships in the data. However, it does appear that both STARS and LabelAppeal have a significant positive relationship with the TARGET, and many of the chemical features have at least some negative relationship with the TARGET as lower values led to more values of 8 and 7 in the target variable.

Overall, although our plots indicate some interesting relationships between our variables, they also reveal some significant issues with the data.

For instance, most of the predictor variables are skewed or non-normally distributed, and will need to be transformed. Additionally, there are many data points that contain missing data that will need to be either imputed or discarded. There also was the issue of nonsensical negative values. We have chosen to take the absolute value of these as the feature value as there are so many of these values: however, there is no evidence to back up that this is a correct decision and that these values are not simply missing.

Missing Data

When we initially viewed the first few rows of the raw data, we already noticed missing data. Let’s assess which fields have missing data.

##    values                ind
## 1   26.25              STARS
## 2    9.46          Sulphates
## 3    5.33 TotalSulfurDioxide
## 4    5.10            Alcohol
## 5    5.06  FreeSulfurDioxide
## 6    4.99          Chlorides
## 7    4.81      ResidualSugar
## 8    3.09                 pH
## 9    0.00             TARGET
## 10   0.00       FixedAcidity
## 11   0.00    VolatileAcidity
## 12   0.00         CitricAcid
## 13   0.00            Density
## 14   0.00        LabelAppeal
## 15   0.00          AcidIndex

In the project description, it was noted that the fact that a certain variable is missing may be predictive. We will impute STARS=NA to STARS=0. We will impute the remaining missing data using caret::preProcess and method=knnImpute. Note that preProcess will also center, scale and BoxCox our features at the same time.

Feature-Target Correlations

With our missing data imputed correctly, we can now build off the scatter plots from above to quantify the correlations between our target variable and predictor variable. We will want to choose those with stronger positive or negative correlations. Features with correlations closer to zero will probably not provide any meaningful information on explaining wins by a team.

##          values                ind
## 1   0.685381473              STARS
## 2   0.356500469        LabelAppeal
## 3   0.062030498            Alcohol
## 4   0.051730323 TotalSulfurDioxide
## 5   0.043996542  FreeSulfurDioxide
## 6   0.016187709      ResidualSugar
## 7   0.008684633         CitricAcid
## 8  -0.009081197                 pH
## 9  -0.035589560            Density
## 10 -0.039072231          Chlorides
## 11 -0.039917146          Sulphates
## 12 -0.049010939       FixedAcidity
## 13 -0.088793212    VolatileAcidity
## 14 -0.221991949          AcidIndex

STARS, LabelAppeal, and Alcohol have the highest correlation with TARGET, which matches what we saw in the variable plots above. Recall that we imputed NA values for STARS as 0. WE note that the missing value dummy indicators do not correlate with TARGET, so these additional columns may not provide much additional predictive power. We will include them for now.

Multicolinearity

One problem that can occur with multi-variable regression is correlation between variables, or multicolinearity. A quick check is to run correlations between variables.

We see that the features have very low correlations with each other, meaning that there is not much multicollinearity present in the dataset. This means that the assumptions of linear regression are more likely to be met.

2. Data Preparation

To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:

Removed Fields

We removed the INDEX field as it offers no information for a model.

Missing Values

For the 8 features with missing values, we created a dummy variable that is 1 if the value is missing and 0 if it is not. We then imputed the missing values as the median of the feature, allowing us to using records with missing values while still including the information that the value is missing.

Outliers

Many of the numerical features had unreasonable negative values. We transformed these to be the absolute value of the feature, as a data entry error seems likely given the quantity of negative records.

Transform non-normal variables

Finally, as mentioned earlier in our data exploration, during the impute step, caret::preProcess() automatically centered, scaled and BoxCox transformed our data. Here are some plots to demonstrate the changes in distributions and final values after the transformations:

We see that after the transformations, the variables are more centered and more closely resemble a normal distribution, although clearly they are still not perfect normal distributions.

Finalizing the dataset for model building

With our transformations complete, we can now add these into our clean_df dataframe and continue on to build our models. To better measure each model performance, we split our data into a training and testing data set. We will train using the first, then measure model performance again the testing hold out set.

## [1] "Number of Training Samples:  10238"
## [1] "Number of Testing Samples:  2557"

3. Model Building

Poisson Model 1

In this first model, we include all available features. Features include:

FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = trainingData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2019  -0.6471  -0.0069   0.4413   3.6530  
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                               0.181739   0.320017   0.568
## FixedAcidity                             -0.000411   0.005809  -0.071
## VolatileAcidity                          -0.021742   0.005744  -3.785
## CitricAcid                                0.003765   0.005678   0.663
## ResidualSugar                             0.001296   0.005824   0.223
## Chlorides                                -0.009576   0.005843  -1.639
## FreeSulfurDioxide                         0.013779   0.005750   2.396
## TotalSulfurDioxide                        0.019508   0.005882   3.317
## Density                                  -0.007099   0.005748  -1.235
## pH                                       -0.008008   0.005786  -1.384
## Sulphates                                -0.014142   0.005935  -2.383
## Alcohol                                   0.013964   0.005870   2.379
## as.factor(LabelAppeal)-1.11204793733397   0.254481   0.042766   5.951
## as.factor(LabelAppeal)0.0101741115806247  0.442918   0.041730  10.614
## as.factor(LabelAppeal)1.13239616049522    0.579996   0.042436  13.668
## as.factor(LabelAppeal)2.25461820940981    0.715625   0.047786  14.976
## as.factor(AcidIndex)-3.59682937695875    -0.370876   0.324680  -1.142
## as.factor(AcidIndex)-1.79176983045029    -0.298840   0.317599  -0.941
## as.factor(AcidIndex)-0.545318540973785   -0.319948   0.317282  -1.008
## as.factor(AcidIndex)0.362910765511677    -0.348949   0.317343  -1.100
## as.factor(AcidIndex)1.05172974217783     -0.463021   0.317757  -1.457
## as.factor(AcidIndex)1.59059728918163     -0.600379   0.319144  -1.881
## as.factor(AcidIndex)2.02271372429848     -0.993138   0.323403  -3.071
## as.factor(AcidIndex)2.37629509167962     -1.060688   0.331479  -3.200
## as.factor(AcidIndex)2.67051656830802     -0.870580   0.334070  -2.606
## as.factor(AcidIndex)2.9188445277671      -0.868512   0.344947  -2.518
## as.factor(AcidIndex)3.13100139587667     -0.461426   0.437919  -1.054
## as.factor(AcidIndex)3.31417429494859     -1.140727   0.548705  -2.079
## as.factor(AcidIndex)3.47378568897179     -1.268681   0.548933  -2.311
## as.factor(STARS)-0.42623524866846         0.751031   0.021945  34.223
## as.factor(STARS)0.416552574962037         1.065094   0.020481  52.003
## as.factor(STARS)1.25934039859254          1.185733   0.021509  55.127
## as.factor(STARS)2.10212822222303          1.315180   0.027314  48.150
##                                                      Pr(>|z|)    
## (Intercept)                                          0.570100    
## FixedAcidity                                         0.943600    
## VolatileAcidity                                      0.000154 ***
## CitricAcid                                           0.507247    
## ResidualSugar                                        0.823916    
## Chlorides                                            0.101242    
## FreeSulfurDioxide                                    0.016559 *  
## TotalSulfurDioxide                                   0.000911 ***
## Density                                              0.216792    
## pH                                                   0.166379    
## Sulphates                                            0.017181 *  
## Alcohol                                              0.017366 *  
## as.factor(LabelAppeal)-1.11204793733397         0.00000000267 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                0.253336    
## as.factor(AcidIndex)-1.79176983045029                0.346738    
## as.factor(AcidIndex)-0.545318540973785               0.313261    
## as.factor(AcidIndex)0.362910765511677                0.271508    
## as.factor(AcidIndex)1.05172974217783                 0.145073    
## as.factor(AcidIndex)1.59059728918163                 0.059943 .  
## as.factor(AcidIndex)2.02271372429848                 0.002134 ** 
## as.factor(AcidIndex)2.37629509167962                 0.001375 ** 
## as.factor(AcidIndex)2.67051656830802                 0.009161 ** 
## as.factor(AcidIndex)2.9188445277671                  0.011809 *  
## as.factor(AcidIndex)3.13100139587667                 0.292030    
## as.factor(AcidIndex)3.31417429494859                 0.037622 *  
## as.factor(AcidIndex)3.47378568897179                 0.020823 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18304  on 10237  degrees of freedom
## Residual deviance: 10852  on 10205  degrees of freedom
## AIC: 36482
## 
## Number of Fisher Scoring iterations: 6
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5914580     0.5186798     2.2269670 36481.6428640 36720.3602957

Poisson Model 2

In this second model, we only include the most predictive features based on our first Poisson Model. The predictors for the following model are:

VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS), family = poisson, data = clean_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2335  -0.6501  -0.0035   0.4410   3.6951  
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                               0.001201   0.318675   0.004
## VolatileAcidity                          -0.023533   0.005122  -4.595
## TotalSulfurDioxide                        0.016952   0.005244   3.232
## Alcohol                                   0.016009   0.005231   3.060
## as.factor(LabelAppeal)-1.11204793733397   0.240289   0.037999   6.324
## as.factor(LabelAppeal)0.0101741115806247  0.430806   0.037064  11.623
## as.factor(LabelAppeal)1.13239616049522    0.564104   0.037710  14.959
## as.factor(LabelAppeal)2.25461820940981    0.699216   0.042446  16.473
## as.factor(AcidIndex)-3.59682937695875    -0.140143   0.322337  -0.435
## as.factor(AcidIndex)-1.79176983045029    -0.097789   0.316887  -0.309
## as.factor(AcidIndex)-0.545318540973785   -0.131342   0.316605  -0.415
## as.factor(AcidIndex)0.362910765511677    -0.162635   0.316637  -0.514
## as.factor(AcidIndex)1.05172974217783     -0.272917   0.316940  -0.861
## as.factor(AcidIndex)1.59059728918163     -0.432064   0.318025  -1.359
## as.factor(AcidIndex)2.02271372429848     -0.795786   0.321596  -2.474
## as.factor(AcidIndex)2.37629509167962     -0.810861   0.327262  -2.478
## as.factor(AcidIndex)2.67051656830802     -0.646459   0.330156  -1.958
## as.factor(AcidIndex)2.9188445277671      -0.738613   0.342710  -2.155
## as.factor(AcidIndex)3.13100139587667     -0.286638   0.403433  -0.710
## as.factor(AcidIndex)3.31417429494859     -0.951704   0.547993  -1.737
## as.factor(AcidIndex)3.47378568897179     -1.196936   0.548071  -2.184
## as.factor(STARS)-0.42623524866846         0.757245   0.019564  38.705
## as.factor(STARS)0.416552574962037         1.075518   0.018261  58.896
## as.factor(STARS)1.25934039859254          1.194174   0.019232  62.093
## as.factor(STARS)2.10212822222303          1.313789   0.024330  53.999
##                                                      Pr(>|z|)    
## (Intercept)                                           0.99699    
## VolatileAcidity                                0.000004335408 ***
## TotalSulfurDioxide                                    0.00123 ** 
## Alcohol                                               0.00221 ** 
## as.factor(LabelAppeal)-1.11204793733397        0.000000000256 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                 0.66373    
## as.factor(AcidIndex)-1.79176983045029                 0.75763    
## as.factor(AcidIndex)-0.545318540973785                0.67825    
## as.factor(AcidIndex)0.362910765511677                 0.60751    
## as.factor(AcidIndex)1.05172974217783                  0.38918    
## as.factor(AcidIndex)1.59059728918163                  0.17428    
## as.factor(AcidIndex)2.02271372429848                  0.01334 *  
## as.factor(AcidIndex)2.37629509167962                  0.01322 *  
## as.factor(AcidIndex)2.67051656830802                  0.05023 .  
## as.factor(AcidIndex)2.9188445277671                   0.03114 *  
## as.factor(AcidIndex)3.13100139587667                  0.47740    
## as.factor(AcidIndex)3.31417429494859                  0.08244 .  
## as.factor(AcidIndex)3.47378568897179                  0.02897 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 13550  on 12770  degrees of freedom
## AIC: 45542
## 
## Number of Fisher Scoring iterations: 6
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5911736     0.5169439     2.2257546 45542.3183589 45728.7386026

Negative Binomial Model 3

Similar to Poisson Model 1, the predictors for the following model are:

FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = clean_df, 
##     init.theta = 40957.00204, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2219  -0.6496  -0.0055   0.4446   3.6790  
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                               0.0275198  0.3190721   0.086
## FixedAcidity                              0.0010176  0.0051829   0.196
## VolatileAcidity                          -0.0231944  0.0051231  -4.527
## CitricAcid                                0.0039724  0.0050838   0.781
## ResidualSugar                             0.0005767  0.0052060   0.111
## Chlorides                                -0.0122552  0.0052206  -2.347
## FreeSulfurDioxide                         0.0132531  0.0051832   2.557
## TotalSulfurDioxide                        0.0170166  0.0052484   3.242
## Density                                  -0.0074549  0.0050938  -1.464
## pH                                       -0.0066655  0.0051875  -1.285
## Sulphates                                -0.0106282  0.0053149  -2.000
## Alcohol                                   0.0157805  0.0052355   3.014
## as.factor(LabelAppeal)-1.11204793733397   0.2398534  0.0380017   6.312
## as.factor(LabelAppeal)0.0101741115806247  0.4300463  0.0370666  11.602
## as.factor(LabelAppeal)1.13239616049522    0.5633460  0.0377142  14.937
## as.factor(LabelAppeal)2.25461820940981    0.6992610  0.0424548  16.471
## as.factor(AcidIndex)-3.59682937695875    -0.1651538  0.3226593  -0.512
## as.factor(AcidIndex)-1.79176983045029    -0.1214774  0.3172467  -0.383
## as.factor(AcidIndex)-0.545318540973785   -0.1558113  0.3169954  -0.492
## as.factor(AcidIndex)0.362910765511677    -0.1871144  0.3170413  -0.590
## as.factor(AcidIndex)1.05172974217783     -0.2972766  0.3173791  -0.937
## as.factor(AcidIndex)1.59059728918163     -0.4542592  0.3184811  -1.426
## as.factor(AcidIndex)2.02271372429848     -0.8158352  0.3220684  -2.533
## as.factor(AcidIndex)2.37629509167962     -0.8303961  0.3277299  -2.534
## as.factor(AcidIndex)2.67051656830802     -0.6688133  0.3306330  -2.023
## as.factor(AcidIndex)2.9188445277671      -0.7687131  0.3432641  -2.239
## as.factor(AcidIndex)3.13100139587667     -0.3297889  0.4038365  -0.817
## as.factor(AcidIndex)3.31417429494859     -0.9814037  0.5484760  -1.789
## as.factor(AcidIndex)3.47378568897179     -1.2022430  0.5486104  -2.191
## as.factor(STARS)-0.42623524866846         0.7548590  0.0195728  38.567
## as.factor(STARS)0.416552574962037         1.0732229  0.0182738  58.730
## as.factor(STARS)1.25934039859254          1.1910222  0.0192473  61.880
## as.factor(STARS)2.10212822222303          1.3117031  0.0243440  53.882
##                                                      Pr(>|z|)    
## (Intercept)                                           0.93127    
## FixedAcidity                                          0.84435    
## VolatileAcidity                                0.000005971523 ***
## CitricAcid                                            0.43458    
## ResidualSugar                                         0.91179    
## Chlorides                                             0.01890 *  
## FreeSulfurDioxide                                     0.01056 *  
## TotalSulfurDioxide                                    0.00119 ** 
## Density                                               0.14332    
## pH                                                    0.19882    
## Sulphates                                             0.04554 *  
## Alcohol                                               0.00258 ** 
## as.factor(LabelAppeal)-1.11204793733397        0.000000000276 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                 0.60875    
## as.factor(AcidIndex)-1.79176983045029                 0.70179    
## as.factor(AcidIndex)-0.545318540973785                0.62305    
## as.factor(AcidIndex)0.362910765511677                 0.55506    
## as.factor(AcidIndex)1.05172974217783                  0.34893    
## as.factor(AcidIndex)1.59059728918163                  0.15377    
## as.factor(AcidIndex)2.02271372429848                  0.01131 *  
## as.factor(AcidIndex)2.37629509167962                  0.01128 *  
## as.factor(AcidIndex)2.67051656830802                  0.04309 *  
## as.factor(AcidIndex)2.9188445277671                   0.02513 *  
## as.factor(AcidIndex)3.13100139587667                  0.41413    
## as.factor(AcidIndex)3.31417429494859                  0.07356 .  
## as.factor(AcidIndex)3.47378568897179                  0.02842 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40957) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 13529  on 12762  degrees of freedom
## AIC: 45540
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40957 
##           Std. Err.:  34344 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45472.13
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5906992     0.5182237     2.2252109 45540.1321552 45793.6636866

Negative Binomial Model 4

Similar to Poisson Model 2, the predictors for the following model are:

VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS), data = clean_df, init.theta = 40912.67371, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2456  -0.6517  -0.0038   0.4399   3.6952  
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                               0.014317   0.318731   0.045
## VolatileAcidity                          -0.023451   0.005122  -4.578
## FreeSulfurDioxide                         0.013266   0.005179   2.562
## TotalSulfurDioxide                        0.016879   0.005244   3.219
## Alcohol                                   0.016237   0.005231   3.104
## as.factor(LabelAppeal)-1.11204793733397   0.240039   0.038000   6.317
## as.factor(LabelAppeal)0.0101741115806247  0.430314   0.037065  11.610
## as.factor(LabelAppeal)1.13239616049522    0.563287   0.037712  14.936
## as.factor(LabelAppeal)2.25461820940981    0.698162   0.042449  16.447
## as.factor(AcidIndex)-3.59682937695875    -0.154135   0.322401  -0.478
## as.factor(AcidIndex)-1.79176983045029    -0.110580   0.316945  -0.349
## as.factor(AcidIndex)-0.545318540973785   -0.143771   0.316660  -0.454
## as.factor(AcidIndex)0.362910765511677    -0.174856   0.316690  -0.552
## as.factor(AcidIndex)1.05172974217783     -0.285228   0.316994  -0.900
## as.factor(AcidIndex)1.59059728918163     -0.443142   0.318072  -1.393
## as.factor(AcidIndex)2.02271372429848     -0.806122   0.321638  -2.506
## as.factor(AcidIndex)2.37629509167962     -0.819576   0.327296  -2.504
## as.factor(AcidIndex)2.67051656830802     -0.655682   0.330193  -1.986
## as.factor(AcidIndex)2.9188445277671      -0.753474   0.342775  -2.198
## as.factor(AcidIndex)3.13100139587667     -0.299583   0.403483  -0.742
## as.factor(AcidIndex)3.31417429494859     -0.953685   0.548008  -1.740
## as.factor(AcidIndex)3.47378568897179     -1.205542   0.548094  -2.200
## as.factor(STARS)-0.42623524866846         0.756497   0.019567  38.662
## as.factor(STARS)0.416552574962037         1.074730   0.018264  58.844
## as.factor(STARS)1.25934039859254          1.193602   0.019234  62.057
## as.factor(STARS)2.10212822222303          1.314420   0.024331  54.023
##                                                      Pr(>|z|)    
## (Intercept)                                           0.96417    
## VolatileAcidity                                0.000004685011 ***
## FreeSulfurDioxide                                     0.01042 *  
## TotalSulfurDioxide                                    0.00129 ** 
## Alcohol                                               0.00191 ** 
## as.factor(LabelAppeal)-1.11204793733397        0.000000000267 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                 0.63259    
## as.factor(AcidIndex)-1.79176983045029                 0.72717    
## as.factor(AcidIndex)-0.545318540973785                0.64981    
## as.factor(AcidIndex)0.362910765511677                 0.58086    
## as.factor(AcidIndex)1.05172974217783                  0.36823    
## as.factor(AcidIndex)1.59059728918163                  0.16356    
## as.factor(AcidIndex)2.02271372429848                  0.01220 *  
## as.factor(AcidIndex)2.37629509167962                  0.01228 *  
## as.factor(AcidIndex)2.67051656830802                  0.04706 *  
## as.factor(AcidIndex)2.9188445277671                   0.02794 *  
## as.factor(AcidIndex)3.13100139587667                  0.45779    
## as.factor(AcidIndex)3.31417429494859                  0.08181 .  
## as.factor(AcidIndex)3.47378568897179                  0.02784 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40912.67) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 13543  on 12769  degrees of freedom
## AIC: 45540
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40913 
##           Std. Err.:  34297 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45486.18
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5909365     0.5174052     2.2255553 45540.1774667 45741.5113299

Linear Model 5

The predictors for the following model are:

FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = clean_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9635 -0.8591  0.0325  0.8384  6.0750 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                               0.995095   0.755491   1.317
## FixedAcidity                              0.004930   0.011720   0.421
## VolatileAcidity                          -0.073667   0.011565  -6.370
## CitricAcid                                0.014455   0.011558   1.251
## ResidualSugar                             0.004027   0.011782   0.342
## Chlorides                                -0.038485   0.011776  -3.268
## FreeSulfurDioxide                         0.039825   0.011786   3.379
## TotalSulfurDioxide                        0.049447   0.011816   4.185
## Density                                  -0.022475   0.011545  -1.947
## pH                                       -0.018619   0.011704  -1.591
## Sulphates                                -0.028248   0.012015  -2.351
## Alcohol                                   0.050812   0.011830   4.295
## as.factor(LabelAppeal)-1.11204793733397   0.367639   0.062729   5.861
## as.factor(LabelAppeal)0.0101741115806247  0.835185   0.061168  13.654
## as.factor(LabelAppeal)1.13239616049522    1.302062   0.063917  20.371
## as.factor(LabelAppeal)2.25461820940981    1.889951   0.084169  22.454
## as.factor(AcidIndex)-3.59682937695875    -0.334854   0.767938  -0.436
## as.factor(AcidIndex)-1.79176983045029    -0.220221   0.754101  -0.292
## as.factor(AcidIndex)-0.545318540973785   -0.322349   0.753472  -0.428
## as.factor(AcidIndex)0.362910765511677    -0.429363   0.753539  -0.570
## as.factor(AcidIndex)1.05172974217783     -0.732560   0.754117  -0.971
## as.factor(AcidIndex)1.59059728918163     -1.041297   0.755385  -1.378
## as.factor(AcidIndex)2.02271372429848     -1.513113   0.757752  -1.997
## as.factor(AcidIndex)2.37629509167962     -1.533481   0.762156  -2.012
## as.factor(AcidIndex)2.67051656830802     -1.552014   0.769606  -2.017
## as.factor(AcidIndex)2.9188445277671      -1.400154   0.777299  -1.801
## as.factor(AcidIndex)3.13100139587667     -0.692206   0.883131  -0.784
## as.factor(AcidIndex)3.31417429494859     -1.772148   0.952843  -1.860
## as.factor(AcidIndex)3.47378568897179     -1.920432   0.900840  -2.132
## as.factor(STARS)-0.42623524866846         1.346560   0.032920  40.904
## as.factor(STARS)0.416552574962037         2.381720   0.032021  74.381
## as.factor(STARS)1.25934039859254          2.942287   0.037079  79.352
## as.factor(STARS)2.10212822222303          3.629958   0.059150  61.368
##                                                      Pr(>|t|)    
## (Intercept)                                           0.18781    
## FixedAcidity                                          0.67401    
## VolatileAcidity                                0.000000000196 ***
## CitricAcid                                            0.21109    
## ResidualSugar                                         0.73251    
## Chlorides                                             0.00109 ** 
## FreeSulfurDioxide                                     0.00073 ***
## TotalSulfurDioxide                             0.000028712970 ***
## Density                                               0.05160 .  
## pH                                                    0.11167    
## Sulphates                                             0.01873 *  
## Alcohol                                        0.000017590098 ***
## as.factor(LabelAppeal)-1.11204793733397        0.000000004722 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                 0.66281    
## as.factor(AcidIndex)-1.79176983045029                 0.77027    
## as.factor(AcidIndex)-0.545318540973785                0.66879    
## as.factor(AcidIndex)0.362910765511677                 0.56883    
## as.factor(AcidIndex)1.05172974217783                  0.33136    
## as.factor(AcidIndex)1.59059728918163                  0.16807    
## as.factor(AcidIndex)2.02271372429848                  0.04586 *  
## as.factor(AcidIndex)2.37629509167962                  0.04424 *  
## as.factor(AcidIndex)2.67051656830802                  0.04375 *  
## as.factor(AcidIndex)2.9188445277671                   0.07168 .  
## as.factor(AcidIndex)3.13100139587667                  0.43317    
## as.factor(AcidIndex)3.31417429494859                  0.06293 .  
## as.factor(AcidIndex)3.47378568897179                  0.03304 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.302 on 12762 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5429 
## F-statistic: 475.9 on 32 and 12762 DF,  p-value: < 0.00000000000000022
##          RMSE      Rsquared           MAE           aic           bic 
##     1.3049443     0.5419395     1.0186507 43106.3022739 43359.8338054

Linear Model 6

For the final Linear Model, we leverage stepAIC on our Linear Model #5 to choose the most important features.

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     as.factor(LabelAppeal) + as.factor(AcidIndex) + as.factor(STARS), 
##     data = clean_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9616 -0.8590  0.0352  0.8399  6.0675 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                               0.99019    0.75488   1.312
## VolatileAcidity                          -0.07393    0.01156  -6.394
## Chlorides                                -0.03864    0.01177  -3.282
## FreeSulfurDioxide                         0.04009    0.01178   3.403
## TotalSulfurDioxide                        0.04960    0.01181   4.200
## Density                                  -0.02270    0.01154  -1.967
## pH                                       -0.01862    0.01170  -1.591
## Sulphates                                -0.02837    0.01201  -2.362
## Alcohol                                   0.05100    0.01183   4.313
## as.factor(LabelAppeal)-1.11204793733397   0.36722    0.06272   5.854
## as.factor(LabelAppeal)0.0101741115806247  0.83483    0.06116  13.649
## as.factor(LabelAppeal)1.13239616049522    1.30161    0.06391  20.367
## as.factor(LabelAppeal)2.25461820940981    1.88998    0.08416  22.456
## as.factor(AcidIndex)-3.59682937695875    -0.33511    0.76744  -0.437
## as.factor(AcidIndex)-1.79176983045029    -0.21824    0.75353  -0.290
## as.factor(AcidIndex)-0.545318540973785   -0.31837    0.75287  -0.423
## as.factor(AcidIndex)0.362910765511677    -0.42442    0.75293  -0.564
## as.factor(AcidIndex)1.05172974217783     -0.72561    0.75343  -0.963
## as.factor(AcidIndex)1.59059728918163     -1.03390    0.75463  -1.370
## as.factor(AcidIndex)2.02271372429848     -1.50407    0.75695  -1.987
## as.factor(AcidIndex)2.37629509167962     -1.52217    0.76131  -1.999
## as.factor(AcidIndex)2.67051656830802     -1.54018    0.76867  -2.004
## as.factor(AcidIndex)2.9188445277671      -1.38512    0.77635  -1.784
## as.factor(AcidIndex)3.13100139587667     -0.67937    0.88243  -0.770
## as.factor(AcidIndex)3.31417429494859     -1.75343    0.95178  -1.842
## as.factor(AcidIndex)3.47378568897179     -1.89498    0.89978  -2.106
## as.factor(STARS)-0.42623524866846         1.34682    0.03291  40.918
## as.factor(STARS)0.416552574962037         2.38256    0.03201  74.442
## as.factor(STARS)1.25934039859254          2.94276    0.03707  79.374
## as.factor(STARS)2.10212822222303          3.63105    0.05914  61.397
##                                                      Pr(>|t|)    
## (Intercept)                                          0.189639    
## VolatileAcidity                                0.000000000167 ***
## Chlorides                                            0.001034 ** 
## FreeSulfurDioxide                                    0.000669 ***
## TotalSulfurDioxide                             0.000026926678 ***
## Density                                              0.049224 *  
## pH                                                   0.111585    
## Sulphates                                            0.018191 *  
## Alcohol                                        0.000016245384 ***
## as.factor(LabelAppeal)-1.11204793733397        0.000000004904 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                0.662366    
## as.factor(AcidIndex)-1.79176983045029                0.772104    
## as.factor(AcidIndex)-0.545318540973785               0.672396    
## as.factor(AcidIndex)0.362910765511677                0.572976    
## as.factor(AcidIndex)1.05172974217783                 0.335525    
## as.factor(AcidIndex)1.59059728918163                 0.170687    
## as.factor(AcidIndex)2.02271372429848                 0.046941 *  
## as.factor(AcidIndex)2.37629509167962                 0.045584 *  
## as.factor(AcidIndex)2.67051656830802                 0.045124 *  
## as.factor(AcidIndex)2.9188445277671                  0.074426 .  
## as.factor(AcidIndex)3.13100139587667                 0.441383    
## as.factor(AcidIndex)3.31417429494859                 0.065461 .  
## as.factor(AcidIndex)3.47378568897179                 0.035220 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.302 on 12765 degrees of freedom
## Multiple R-squared:  0.544,  Adjusted R-squared:  0.543 
## F-statistic: 525.1 on 29 and 12765 DF,  p-value: < 0.00000000000000022
##          RMSE      Rsquared           MAE           aic           bic 
##     1.3050053     0.5418966     1.0186907 43102.1597330 43333.3208352

4. Model Selection & Analysis

This table summarizes the RMSE, \(R^2\), MAE, AIC and BIC for all 6 models. In terms of raw metrics, The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and \(R^2\); however, Poisson Model 1 had the best performance based on AIC and BIC.

Overall, RMSE an \(R^2\) were not largely different across the 6 models, given this we chose Poisson Model 1 as our final model since it had a far lower AIC.

RMSE Rsquared MAE aic bic
poiss1_eval 2.591458 0.5186798 2.226967 36481.64 36720.36
poiss2_eval 2.591174 0.5169439 2.225755 45542.32 45728.74
nb3_eval 2.590699 0.5182237 2.225211 45540.13 45793.66
nb4_eval 2.590937 0.5174052 2.225555 45540.18 45741.51
lm5_eval 1.304944 0.5419395 1.018651 43106.30 43359.83
lm6_eval 1.305005 0.5418966 1.018691 43102.16 43333.32

The final thing we will take a look at is our variable importance for each model.

The final figure presented below shows the feature importance in each model. As we see, all 6 models identified the same top 10 features.

Predictions

We apply Poisson Model #1 to the holdout evaluation set to predict the TARGET for these instances. We have saved these predictions as csv in the file eval_predictions.csv.

Source code: https://github.com/djlofland/DS621_F2020_Group3/tree/master/Homework_5/eval_predictions.csv

##       TARGET FixedAcidity VolatileAcidity  CitricAcid ResidualSugar   Chlorides
## 1 0.17715256  -0.26524401     -1.51030924 -0.04455813    -0.4776009  0.11673887
## 2 1.35023830   0.84276407      0.07767213 -1.23934313    -0.7442725  3.49856180
## 3 0.93886293   0.01967235      1.81871194 -0.16055667    -1.1383538  0.03195779
## 4 0.84639842  -0.13861451     -0.28584168  1.73021959    -0.1309278 -0.73421194
## 5 0.01159925   0.68447720     -0.14553811 -0.03295827    -0.1250018 -0.05282329
## 6 1.71495385   1.66585579     -0.36237090 -1.69173745    -0.1190758  1.50777652
##   FreeSulfurDioxide TotalSulfurDioxide    Density          pH  Sulphates
## 1       -0.05275591         1.19564455 -0.3402128  2.66647966  0.1211079
## 2       -0.45621338        -0.22730155 -0.1442311  0.23889195  0.6038736
## 3       -0.14689598        -0.19280589  1.9789246  2.06326090  0.1640204
## 4        0.49191169        -0.13675044 -0.2085894 -0.01122314  1.6981424
## 5        0.26328578        -0.29198092  1.3139028 -0.98225823 -0.6405890
## 6       -1.88848742         0.08315942 -1.6483055 -0.21720028 -0.5869484
##      Alcohol LabelAppeal  AcidIndex      STARS
## 1  0.4857435 -1.11204794 -1.7917698 -1.2690231
## 2  1.4782809  0.01017411 -1.7917698  0.4165526
## 3 -0.5202067  0.01017411  0.3629108 -0.4262352
## 4  0.4857435 -1.11204794  0.3629108 -0.4262352
## 5 -1.5261568  0.01017411  1.5905973 -1.2690231
## 6  0.2443154  1.13239616  0.3629108  2.1021282

References

Appendix

R Code

[[[INSERT R CODE HERE ONCE ALL CHANGE DONE]]]